Better fix of timeout activity tick after restart#234
Better fix of timeout activity tick after restart#234jglick merged 2 commits intojenkinsci:masterfrom
timeout activity tick after restart#234Conversation
|
Thanks @jglick for quick reaction on this regression, just in case I've setup 300 executions run in test selector https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/168/ |
|
Seems it occasionally fails in Unclear if this is a bug in production code or just in the test. In any case, I am inclined to ship this amendment anyway, at least if @chwehrli @tarioch can confirm it addresses the performance regression. |
TimeoutStepTest#activityRestart still flaky, failed 3 times in 300 executions. |
|
This also looks good from our end with regards to performance issues we noticed with 226 - Jenkins has been running fine over several hours at regular load - thanks! |
Probably test changes are not needed? If I run ‘old’-versioned test against your fix the test does not flake. |
jglick
left a comment
There was a problem hiding this comment.
Just noting the reason for test changes.
|
|
||
| @Rule public GitSampleRepoRule git = new GitSampleRepoRule(); | ||
|
|
||
| @Rule public LoggerRule logging = new LoggerRule().record(TimeoutStepExecution.class, Level.FINE); |
| + " sleep 10;\n" | ||
| + " echo 'JustHere!';\n" | ||
| + " sleep 30;\n" | ||
| + " sleep 20;\n" |
There was a problem hiding this comment.
A pre-commit version the main patch contained a mistake—the build did not time out until 1½× the stated period; in this case sleep would run for ~23s before being terminated. I was trying to strengthen the test here to verify that the 15s activity timeout really applies.
| WorkflowRun b = p.scheduleBuild2(0).getStartCondition().get(); | ||
| SemaphoreStep.waitForStart("restarted/1", b); | ||
| }); | ||
| Thread.sleep(10_000); // restarting should count as activity |
There was a problem hiding this comment.
Making the test reliably reproduce the originally reported problem.
Like what I did in 512d920? OK, if that is what it takes to address the regression and prevent test flakes, then we can go with that. I just want to ship this promptly. |
#226 correctly diagnosed a problem brought to light by a flaky test, but the fix was not correct (only applied to messages printed within the controller as opposed to from an agent) and apparently introduced a severe performance regression.
I think the root issue was that restarting the controller ought to count as “activity” for purposes of delaying the timeout (better to err on the side of caution) but it did not.